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Who Spoke What? A Latent Variable Framework 
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Abstract —In this paper, we present a latent variable (LV) 
framework to identify all the speakers and their keywords given 
a multi-speaker mixture signal. We introduce two separate LVs to 
denote active speakers and the keywords uttered. The dependency 
of a spoken keyword on the speaker is modeled through a 
conditional probability mass function. The distribution of the 
mixture signal is expressed in terms of the LV mass functions 
and speaker-specific-keyword models. The proposed framework 
admits stochastic models, representing the probability density 
function of the observation vectors given that a particular speaker 
uttered a specific keyword, as speaker-specific-keyword models. 
The LV mass functions are estimated in a Maximum Likelihood 
framework using the Expectation Maximization (EM) algorithm. 
The active speakers and their keywords are detected as modes 
of the joint distribution of the two LVs. In mixture signals, 
containing two speakers uttering the keywords simultaneously, 
the proposed framework achieves an accuracy of 82 % for 
detecting both the speakers and their respective keywords, using 
Student’s-t mixture models as speaker-specific-keyword models. 

Index Terms —GMMs, tMMs, latent variable, 


I. Introduction 

Human conversations, quite often, have multiple people 
talking at the same time. Automatic processing of such 
conversations is essential in the context of human-machine 
interaction (HMI), thus enabling the machine to aid humans 
better. Consider for example, a home environment wherein 
multiple people require the machine to do different things. 
In such a scenario it becomes important for the machine to 
understand who spoke what. This requires the machine to 
be able to identify multiple speakers, recognize speech from 
multiple speakers and associate the recognized speech streams 
to the corresponding speaker. 

The problem of speaker recognition has effective solutions 
[llj-|8), which are robust to reverberation |9|, environmental 
noise m- large population ED etc. However, recognizing a 
target speaker in multi-speaker scenario is a harder problem 
with fewer solutions m- An even more challenging problem 
is that of identifying multiple speakers in a multi-speaker 
scenario m. 

Although there are robust algorithms and systems for speech 
recognition in a single speaker scenario fl4||-p3), recognizing 
speech in a multi-speaker scenario, is still a challenge. In 
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order to compare different approaches to recognize speech 
from a target speaker in a multi-speaker scenario, Cooke et. al. 
constituted the “Monaural Speech separation and recognition 
challenge” ]24) , wherein, the sentences from several speakers 
were recorded with restricted grammar. The task was to 
recognize the letter and digit spoken by a speaker who spoke 
the word “white”, in a multi-speaker mixture signal, wherein 
the target speaker is masked by another speaker uttering a 
similar sentence, without the word “white”. Although unre¬ 
alistic with a restricted sentence grammar, the task was still 
challenging with only a few approaches able to surpass the 
human performance jT2j, [25] (albeit in a sub category). In 
|26|-p8|, the authors, address the problem of identifying 


the target speaker and his keyword, in multi-speaker mixture 
signal. As noted by the authors, the performance is generally 
poor at a signal to interference ratio (SIR) of 0 dB. 

Thus, in general, the task of recognizing speech from 
multiple speakers is a complex problem. Added to this, in 
order to address the problem of who spoke what, it is required 
to associate the recognized speech segments to their respective 
orators, which is an even more complicated task. 

As noted in [291, the use of restricted vocabulary of 
keywords may be more suitable for a task driven application, 
like HMI in a home environment, where the primary concern 
is to control the machine, for the efficient completion of a task. 
Therefore, the problem now is formulated to detect, which of 
the known speakers uttered which of the known keywords. The 
goal is to identify all the speakers and their keywords rather 
than keyword from a single target speaker. 

In this paper, we propose to address the above task in 
a Latent Variable (LV) framework. We associate one LV 
to denote the active speaker and another LV to denote the 
keyword uttered and relate the two LVs through a conditional 
dependency. The probability mass function (p.m.f.) and the 
conditional p.m.f of the LVs are estimated using speaker 
specific keyword models. In order to evaluate the proposed 
framework in relevance to our goal of HMI in a home envi¬ 
ronment, we have created our own database]]] With Student’s-t 
Mixture Models (tMMs) as speaker specific keyword models, 
the proposed approach is able to detect at least one speaker- 
keyword pair, in mixture signal with two speakers, with an 
accuracy of 99% and both speaker-keyword pairs, with an 
accuracy of 82%. The contributions of this paper are: (i) 


1 The entire database is available for research purpose only and available 
on contacting the corresponding author. 
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Formulation of the problem of identifying who spoke what 
in a LV framework (Sec. 0; (ii) Casting the problem of LV 
density estimation in a maximum likelihood (ML) framework 
and solving the same using an Expectation-Maximization 
(EM) algorithm (Sec. 0; (iii) Experimental evaluation of the 
proposed approach on a newly collected database meant for 
HMI in a home environment (Sec. [nu¬ 
ll. Proposed Latent Variable (LV) Formulation 

Let Sk denote the k th speaker in the set S = 
{Si, S 2 ,..., Sm} of M known speakers. The mixture signal 
-x[n] contains speech from a subset of speakers from the 
set S. In a given signal x[n], a speaker is said to be active 
if he/she speaks and passive otherwise. We assume that the 
speakers only utter keywords from the vocabulary V = 

{ Vi. V 2 ...., Ly}. Let X denote the features estimated from 
x[n\, X = [x^x 2 , ... . ,x T \ ; Xj £ TZ D -, X £ lZ DxT . 

We introduce two Boolean LVs, one to denote the active 
speakers (U 3 £ B Mxl ), and the other to denote the keywords 
uttered ( Wj £ B Nxl ), for each feature vector Xj. Uj(k) = 1 
iff the k th speaker Sk is active in the j th frame. Wj(l ) = 1 iff 
the I th keyword has been uttered in the j th frame. Modeling 
the conditional p.d.f. of Xj given multiple active speakers as a 
sum of the conditional p.d.f. of Xj given individual speakers, 
the p.d.f. of Xp is obtained as a marginal of the joint p.d.f. of 
Xj,Uj, and W i.e., 

M N 

Pr (*,•) = E E Pr M ( fc ) = !) Pr TOO = 1 | Uj(k) = 1) 

k=1 1=1 

Pr (xj | Uj(k) = l,Wj(l) = l) ( 1 ) 

where, Pr (Uj(k) = 1) denotes the probability of 
the k th speaker being active in the j th frame, 
Pr (Wj(l) = 1 | Uj(k) = 1) represents the conditional 
probability of the I th keyword being uttered by the k th speaker 
in the j th frame, and Pr ( 2 ; ■ | Uj{k) = 1, W 3 (l) = l) denotes 
the probability of x ;j given that in the j th frame the k th speaker 
uttered the I th keyword. The two LVs are related through 
the use of conditional p.d.f Pr (Wj(l) = 1 | Uj(k) = 1), as 
our goal is to estimate which speaker spoke which keyword. 
If the LVs are assumed to be independent, then we will 
be able to decode the set of active speakers and the set 
of keywords uttered, but the association of the keyword to 
the corresponding active speaker becomes a combinatorial 
problem. 

We use parametric speaker-specific-keyword models to 
compute Pr [Xj \ Uj(k) = 1, W 3 (l) = l). Let the parameters 
of the model for the k th speaker uttering the I th keyword 
be denoted by Xu i.e. Pr (ai ; | U 3 (k) = 1, IF,-(7) = l) = 
Pr (xy. Xu) - Let A denote the collection of all speaker specific 
keyword model parameters, A = {A u] 1 < k < M, 1 < l < 
N}. Assuming that the distribution of a particular speaker and 
the distribution of a speaker uttering a particular keyword are 
time homogeneous, we represent Pr (U 3 (k) = 1) by f3k an d 
Pr (Wj(l) = 1 | Uj(k ) = 1) as Su- Further, assuming that in 
the given utterance x[n], at least one of the M speakers utters 


one of the N keywords 0 we have: 

M N 

Pr (xj ; A) = E Pk Y Su Pr (xj\ X k i) (2) 

k=1 1=1 

M N 

with 5^ Pk = 1; and J2 5 ki = 1 ; VI < k < M. Let S = [Ski], 

k =i ;=1 

/3 = [/3i,..., Sm]. We propose to solve for the parameters 
S and /3, in an ML framework using the EM algorithm [3lj|, 
keeping the parameters A fixed. 


A. EM algorithm for estimation of fj and S 

Let the posterior probabilities at the m th EM iteration be 
defined as: = Pr (U 3 (k) = 1 | x^X k i), 

Cjki ~ Pr (Uj{k) = 1 ,Wj(l) = 1 | Xj-, Xu)- The Q-function, 
which is the conditional expectation of the log-likelihood of 
the overall data w.r.t. the LVs given the data, is derived as: 


T M N 

Q [ft, v> (m) ; a) = E E E 7 & } ln ^ + & ln (40 

j=1k=11=1 

+ ln (Pr (Xj\ Xu)) (3) 

where ft = {S,f3} is the collection of parameters to be 
estimated and ft= {S^ m \f is the collection of the 
parameters at the m th EM iteration. The EM update equations 
for rj y jk and 7 x - ki are given as: 




p.-fe-Au) 

_ 1=1 _ 

E filT'Zsffl *(&;**,) 

k 1=1 (=1 


Jm) _ 
jkl - 


P 


{m)Am) 
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Pr ( Xj\Xu ) 


M . N 

E E s 


(m) 

kill 


Pr (xj\X klh ) 


(4) 


(5) 


The parameters S^ and /3 ^ are updated as: 


/3£ m+1) 


c(m+l) 
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Pk 
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QU,ft^-X) ; s. t-E4 = l 


3 =1 
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( 6 ) 


; s. t. E Ski = 1 
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(7) 


2 Out of vocabulaiy (OOV) keywords can be handled using a garbage model 
for each speaker. Effectively, this increases the vocabulary size to N + 1. 
However, the choice of suitable garbage model to effectively detect the OOV 
words is beyond the scope of this paper. 
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TABLE I: Keywords 


Index 

Key-Phrase 

Index 

Key-Phrase 

i. 

Answer 

6. 

Music 

2. 

Disconnect 

7. 

Number 

3. 

Emergency 

8. 

Outside 

4. 

Hello 

9. 

Television 

5. 

Inside 

10. 

Volume 


After convergence of the EM algorithm let the values of (3 
and S be denoted as f3* = [/?*,..., j3* M ] and (5* = { 5 1 < 
k < M] 1 < l < N}. Since 5 is a conditional p.d.f. 
of the I th word given the k th speaker, we obtain the joint 
probability matrix ( JPM ), to detect the speaker-keyword 
pairs, as: JPM (k,l) = (3'ji.S^. We first pick the active 
speakers using (3*, and then pick only one peak in the row 
corresponding to active speakers in JPM (k. /). From the 
physics of the problem, although multiple speakers can utter 
the same keyword, a single speaker cannot utter multiple 
keywords at the same time. Thus, we pick only one peak in 
active speaker rows of JPM l). This is also the reason for 
choosing the dependency of Wj(l) on Uj(k ) in Eq.(JTJ>, and 
not the other way round. 

We refer to this proposed framework as latent variable based 
detection of speakers and keywords or LVDSK in short. 

1) Using Prior Knowledge: For the estimation of [3 * m+1 ) 


+1) , the EM algorithm requires initial estimates /3^ 


and <5 
and (5 f < 

with: (3^ = f ; V 1 < H M, 


With no prior knowledge, a flat initialization is used 


r(0) 


4;vi< 


k < 


M> ’ — ,v — ^ > "kl 

M;1 < l < N. However, LVDSK can be effectively used 
for incorporating any prior knowledge with regards to either 
the active speakers, or the keywords uttered, effectively, to 
estimate the unknown better. Let M* denote the number of 
active speakers. If the active speakers are known a priori, then 
/ 3 [°' > is set to be J, for active speaker indices and 0 for passive 
speaker indices, sff is set to have equal mass (= -^) for all 
keywords of active speakers and 0 for all keywords of passive 
speakers. If the keywords uttered are known a priori, then 


a flat initialization is used for 


4 0) 


and S^j 1 is set to have 


for the keywords uttered, for all speakers and 0 for other 
keywords of all speakers. In Sec. III-C2 we show that, with 
such prior knowledge LVDSK performs better in estimating 
the remaining unknown quantities better. 



Fig. 1: (Color Online ): An mixture signal with 2 speakers. 


between 80 Hz to 16 kHz. Microphone signals are sampled 
at 16kHz with 16 bits per sample in an anechoic chamber. 
If the utterances of the two speakers are well separated in 
time, then, even with very simple keyword spotting techniques, 
a very accurate detection system can be built. However the 
problem becomes more difficult when the utterances overlap 
in time when one keyword masks the other keyword. In 
order to study this harder problem, albeit unrealistic, the test 
data is generated by adding the utterances from two different 
speakers with an overlap of more than 90%. In the considered 
context, since there is no notion of a target and a masker, 
the measure of SIR is not relevant to characterize the mixture 
signals. Therefore, we employ a related measure - Relative 
Power Ratio (RPR), defined as the ratio of the power of 
each speaker in a mixture signal. All the mixture signals 
generated, have an RPR close to 0 dB. An example mixture 
signal with speech from the two speakers are shown in Fig. [T] 
An example set of test data is available at https://sites.google, 
com/site/harshas 123/downloads The performance of the pro¬ 
posed framework is studied under two categories: Multiple 
speakers uttering different keywords (MSpDKW) and Multiple 
speakers uttering the same keywords (MSpSKW). With 10 
speakers and 10 keywords, there are (!}') x ( 1 9 °) = 2025 
possible combinations of different speakers uttering different 
keywords and e 2 °) >< 10 = 450 possible combinations of 
different speakers uttering the same keyword. We therefore 
generate 2025 mixture signals for the MSpDKW task and 450 
signals for the MSpSKW task. Thirty eight-dimensional Mel 
Frequency Cepstral Coefficients (MFCCs) along with Delta 
and Acceleration coefficients, obtained with a frame length of 
20 ms and frame shift of 10 ms are used as features (omitting 
the energy component of MFCC). 

LVDSK involves detection of speakers, keywords and the 
speaker-keyword pairs. To assess the performance of LVDSK, 
we compute the average percentage recognition of one or both 
of, each of the three entities - speakers, keywords and speaker- 
keyword pairs, in a leave one out cross validation setting. 


III. Performance Evaluation 
A. Database, Features and Performance Measures 

We have created our own database with the context of HMI 
in a home environment. Six male speakers, and 4 female 
speakers are made to utter each of the ten keywords shown in 
Table. |T| with 10 repetitions each. The keywords are chosen 
carefully to ensure that the vocabulary contains long distinct 
words (“Emergency”, “Disconnect”, “Television”, “Volume”), 
smaller distinct words (“Hello”, “Music”), moderately distinct 
words (“Answer”, “Number” ) and confusable words (“In¬ 
side” and “Outside”). The utterances are recorded using an 
omni-directional microphone Audio-Technica AT8004L (For 


specifications c.f. 1321), which has a flat frequency response 


B. LVDSK with different speaker specific keyword models 

Although any stochastic model can be used, we explore 
the use of two different speaker-specific-keyword models : 
Gaussian Mixture Models (GMMs) and Student’s-t Mixture 
Models (tMMs). The parameters A a- / for both GMMs and 
tMMs are obtained using the clean speech utterances of the I th 
keyword by the k th speaker (.S'/, ) in an ML framework using an 
EM algorithm (For GMMs c.f. [33j. For tMMs c.f. |34|). For 
both GMMs and tMMs, 8 mixture components are used. Table 
|II] tabulates the performance of GMMs and tMMs as speaker- 
specific-keyword models in the LVDSK framework. Although 
GMMs and tMMs have nearly comparable performance in de¬ 
tecting at least one entity correctly, tMMs outperform GMMs 
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TABLE II: Overall % recognition accuracy of LVDSK with GMMs and tMMs as speaker-specific-keyword models. 



At least 1 speaker 
detected correctly 

Both speakers 
detected correctly 

At least 1 phrase 
detected correctly 

Both phrases 
detected correctly 

At least 1 speaker-phrase 
detected correctly 

Both speaker-phrases 
detected correctly 

LVDSK-tMM 

99.99 

91.80 

99.80 

83.56 

99.64 

81.66 

LVDSK-GMM 

99.40 

70 

97.70 

61 

96.80 

57.10 


TABLE III: Average % recognition accuracy of LVDSK in different tasks- MSpDKW and MSpSKW with flat initialization. 


Task 

(No. of Utterances) 

At least 1 speaker 
detected correctly 

Both speakers 
detected correctly 

At least 1 phrase 
detected correctly 

Both phrases 
detected correctly 

At least 1 speaker-phrase 
detected correctly 

Both speaker-phrases 
detected correctly 

MSpDKW (2025) 

99.99 

91.33 

99.80 

81.80 

99.65 

80.50 

MSpSKW (450) 

99.97 

93.86 

99.71 

91.50 

99.60 

87.13 

Overall (2475) 

99.99 

91.80 

99.80 

83.56 

99.64 

81.66 


TABLE IV: Overall % recognition accuracy of LVDSK with prior knowledge. 



At least 1 speaker 
detected correctly 

Both speakers 
detected correctly 

At least 1 phrase 
detected correctly 

Both phrases 
detected correctly 

At least 1 speaker-phrase 
detected correctly 

Both speaker-phrases 
detected correctly 

LVDSK-Flat 

99.99 

91.80 

99.80 

83.56 

99.64 

81.66 

Oracle-SPID 

100 

100 

99.81 

85.22 

99.74 

85.22 

Oracle KWID 

100 

92.3 

100 

100 

100 

92.3 



Speaker Index ( k ) 
(c) f k vs k 


Fig. 2: (Color Online ): Performance on mixture speech signal 
with 2 active speakers. 


in detecting both the entities be it speakers, keywords or 
speaker-keyword pairs. From henceforth, we use tMMs as 
speaker-specific-keyword models, in all the experiments to 
follow. 


C. Results 


As an illustration, Fig|2| shows the plots of ground truth 


of who spoke what (Fig. [2aj>, JPM ( k 7 l ) (Fig. 2b i and (3 
(Fig. [2c]) on a sample scalar mixture with two active speakers 
( S 2 and Sg) speaking simultaneously. From Fig. 2b we see 


that LVDSK gives higher probability values for the correct 
speaker speaking the correct word and lower probability values 
for other combinations. It can also be seen that the f3 values 
(Fig|2c| yield higher probability for the active speakers and 
lower probability for the inactive speakers. Thus the active 
speakers are accurately identified in the plot of /3. 


1) Performance under different Tasks: Table. Ill tabulates 
the percentage recognition accuracies for the two tasks and 
the overall performance. LVDSK has near perfect performance 
for recognition of at least one of the two entities (speaker. 


keyword or speaker-keyword pair). The performance of rec¬ 
ognizing both the speakers is higher in the MSpSKW task 
than the MSpDKW task. This is intuitively reasonable because, 
one would expect to differentiate speakers easily when both 
are uttering the same content, versus both speakers uttering 
different content. 

With just simple mixture models, the recognition of both 
the keywords, is more than 80%. In the case of MSpSKW 
task, in which the same keyword is uttered by both speakers, 
although the confusability is higher, when compared to the 
MSpDKW task, LVDSK performs quite well with more than 
90% recognition accuracy for detecting both the keywords. 
This is because, most of the errors occurring in the MSpDKW 
task are when a smaller keyword - like “Hello” or “Music” 
is completely embedded within a larger keyword like “Emer¬ 
gency” or “Disconnect”. Another source of error observed, is 
the confusion between the words “Inside” and “Outside”, when 
the initial part of these words are masked by another keyword. 
These scenarios never occur in the MSpSKW task as the 
keywords are roughly of the same length and spectral content. 
The words, “Answer” and “Number” are usually accurately 
detected, except when their initial part is masked. As expected, 
the longer distinct keywords, are least confused. 

If the speakers and the keywords are correctly detected, but 
they are wrongly mapped to each other, then the recognition 
of both speaker-keyword pairs, will be incorrect. Thus, one 
can expect the accuracy for detecting both speaker-keyword 
pairs to be lower than the lesser of the recognition accuracies 
for detecting both speakers and both keywords. The overall 
performance is computed on combining the MSpDKW and the 
MSpSKW tasks. Since, the MSpDKW task has more number 
of utterances, the overall performance of LVDSK framework 
is closer to MSpDKW than the MSpSKW task. 

In general, the speaker recognition accuracy is higher than 
keyword recognition accuracy owing to the use of mix¬ 
ture models (and the assumption of i.i.d. features, thereof). 
With more sophisticated keyword models (Long Short Term 
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Memory (LSTM) recurrent neural networks |35)), it may be 
possible to achieve higher accuracy for recognizing both the 
keywords. More investigations regarding the suitable choice 
of speaker-specific keyword models are warranted, and this is 
beyond the scope of this paper. Importantly, any such models 
(LSTM networks) can be used in the proposed framework. 

2) Performance with Prior Knowledge: Let the scenario 
in which, the prior knowledge of active speakers is used in 
the LVSDK framework (as shown in Sec. |II-A1 1 >, be referred 
to as Oracle speaker identities (Oracle-SpID). Similarly, the 
scenario in which the prior knowledge of keywords is incor¬ 
porated into LVDSK is referred to as Oracle keyword iden¬ 
tities (Oracle-KWID). Table III tabulates the performance of 
LVDSK under the Oracle-SPID and Oracle-KWID scenarios. 
Clearly, using the prior knowledge of the active speakers 
boosts the recognition accuracy of keywords and hence the 
speaker-keyword pairs. Similarly, using the prior knowledge of 
keywords, boosts the recognition accuracy of active speakers. 
It is observed that the prior knowledge of keywords is more 
informative to the LVDSK framework than the prior knowl¬ 
edge of the speakers. This can be attributed to the mixture 
models being better at modeling speakers than keywords. 

IV. Conclusions 

In this paper we have proposed a LV framework to ad¬ 
dress the problem of detecting multiple speakers and their 
keywords in a multi-speaker mixture signal. The proposed 
framework is generic enough to incorporate any kind of 
speaker specific keyword models. Analysis of LVDSK with 
GMMs and tMMs as speaker-specific-keyword models showed 
the superior performance of tMMs over GMMs. LVDSK also 
offers an elegant way to incorporate prior knowledge about the 
speaker and/or the keywords. With accurate prior knowledge 
of the keywords a significant improvement in the recognition 
of speaker-keyword pairs is achieved. 
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